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In this paper, we investigate the theoretical and empirical properties of L2 boosting with kernel 
regression estimates as weak learners. We show that each step of L2 boosting reduces the bias of 
the estimate by two orders of magnitude, while it does not deteriorate the order of the variance. 
We illustrate the theoretical findings by some simulated examples. Also, we demonstrate that 
L2 boosting is superior to the use of higher-order kernels, which is a well-known method of 
reducing the bias of the kernel estimate. 
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1. Introduction 

In the last decade, several important approaches for classification and pattern recogni- 
tion have been proposed with feasible computational algorithms in the machine learn- 
ing community. Boosting is one of the most promising techniques that has recently re- 
ceived a great deal of attention from the statistical community. It was first proposed by 
Schapire (1990) as a means of improving the performance of a given method, called a 
weak learner. Subsequent investigations of the methods have been made in both commu- 
nities. These include, among others, Freund (1995); Freund and Schapire (1996, 1997); 
Schapire, Freund, Bartlett and Lee (1998); Breiman (1998, 1999); Schapire and Singer 
(1999); Friedman, Hastie and Tibshirani (2000); Friedman (2001). 

Understanding boosting algorithms as functional gradient descent techniques gives the- 
oretical justifications of the methods; see Mason, Baxter, Bartlett and Frean (2000) and 
Friedman (2001). It connects various boosting algorithms to statistical optimization prob- 
lems with corresponding loss functions. For example, AdaBoost (Freund and Schapire 
(1996)) can be interpreted as giving an approximate solution, starting from an initial 
learner, to the problem of minimizing the exponential risk for classification. Also, Logit- 
Boost corresponds to an approximate optimization of the log-likelihood of binary random 
variables, see Friedman et al. (2000). 

In this paper, we study boosting as a successful bias reduction method in nonparamctric 
regression. Since the regression function to(-) = E(Y\X = •) is the minimizer of the L2 
risk E[Y — m{X)]'^, it is natural to take the squared error loss as an objective function. 
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Application of the functional gradient descent approach to the L2 risk is trivial since 
minimization of the L2 risk itself is a linear problem and thus there is no need to linearize 
it. In fact, the population version of L2 boost is nothing else than adding m — mo to an 
initial function mo so that a single update yields an exact solution. However, the empirical 
L2 boost is non-trivial. It amounts to repeated least-squares fitting of residuals. 

L2 boost in the context of regression has been studied by Friedman (2001) and 
Biihlmann and Yu (2003). In the latter work, the authors provided some expressions 
for the average squared bias and the average variance of the L2 boost estimate obtained 
from a linear smoother in terms of the eigenvalues of the corresponding smoother ma- 
trix. They also showed that, if the learner is a smoothing spline, it is possible for L2 
boosting to achieve the optimal rate of convergence for all higher-order smoothness of 
the regression function. In doing so, they took the iteration number, rather than the 
penalty constant, as the regularization parameter. The optimal rate is attained if one 
takes the iteration number r = 0{n'^P^^'^'^~^^^) as the sample size n goes to infinity, where 
p is the order of the smoothing spline learner and v is the smoothness of the regression 
function. 

In this paper, we investigate the theoretical and empirical properties of L2 boost- 
ing when the learner is the Nadaraya- Watson kernel smoother. We derive the bias 
and variance properties of the estimate in terms of the bandwidth (smoothing pa- 
rameter), which is more conventional in nonparametric function estimation. We show 
that the optimal rate of convergence is also achieved by the Nadaraya- Watson L2 
boosting for all smoothness of the regression function if the iteration number r is 
high enough, depending on the smoothness ly, and the bandwidth is properly cho- 
sen as 

0(„-i/(2>.+i))^ In particular, we prove that each step of L2 boosting reduces 
the bias of the estimate by two orders of the bandwidth, and also that additional 
boosting steps do not deteriorate the order of the variance. We illustrate these the- 
oretical findings by some simulated examples in a numerical study. Also, we com- 
pare the finite sample properties of L2 boosting with those of higher-order kernel 
smoothing, the latter being a well-known method of reducing the bias of the esti- 
mate. Our results suggest that L2 boosting is superior to the use of higher-order ker- 
nels. 

2. Main results 

The L2 boosting algorithm is derived from application of the functional gradient descent 
technique to the L2 loss. The task of the latter is to find the function m that minimizes 
a functional ip{m). With an initial function mo, one searches the best direction S such 
that ipimQ +eS) is minimized. Let ip{S) be the Gateaux differential of "0 with increment 
S, that is, 



^(.).limfc£^l^. 
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To first order in £, minimizing ?/'(mo + e5) with respect to 5 is equivalent to minimizing 
'i/'((5)(mo). Let 5i denote the minimizer. The update of the initial mo is given by mi = 
mo + £i(5i, where £i minimizes ip{mQ + eSi). Then, the process is iterated. 

Let m{x) = E{Y\X = x) be the regression function. If one applies the functional gra- 
dient descent technique to the L2 loss, ?/;(m) ~ ■^ElY — m{X)]'^, then one gets 

^{S){mo) = ~E[S{X)iY - mo{X))]. 

Since minimizing —E[S{X){Y — mo{X))] subject to ESiX)"^ = c for some constant c > 
is equivalent to minimizing E[Y — mo(X) — 6{X)]'^, it follows that the updated function 
mi is given by mi = mo + <5i, where 

61 = argmini;[y - mo(X) - 6{X)]^ = E[Y - mo{X)\X = •]. 
5 

Thus, with the L2 loss, the update mi equals the true function m: 

mi{x) = mo(x) + E[Y — mQ{X)\X = x] = m{x). 

The L2 boosting algorithm given below is an empirical version of the updating procedure 
above. 

Algorithm (L2 Boosting). 

Step 1 (Initialization) : Given a sample S = {{Xi, Yi),i = 1, . . . ,n} , fit an initial esti- 
mate mo{x) =m{x;S) to the data. 

Step 2 (Iteration): Repeat for ?' = 1, . . . , i?. 

(i) Compute the residuals e.; = — mr-i{Xi) , i = 1, . . . , n. 

(ii) Fit an estimate rh{x\Se) to the data Se = {{Xi,ei),i = 1, . . . ,n}. 

(iii) Update rhr{x) ~ mr-i{x) + m{x]Se) ■ 

Thus, L2 boosting is simply repeated least-squares fitting of residuals. With r ~ 1 
(one-step boosting), it has been already proposed by Tukey (1977), usually referred 
to as "twicing" . Twicing is related to using higher-order kernels. It was observed by 
Stiitzle and Mittal (1979) that, in the case of the fixed equispaced design points Xi = 
i/n, twicing a kernel smoother is asymptotically equivalent to directly using a higher- 
order kernel. To be more specific, let be a kernel function, ft, > be the bandwidth 
and Kh{u) — K{u/h)/h. Define K* — 2K — [K * A'), where * denotes the convolution 
operator. Note that K* is a higher-order kernel. If mo(a;) = ^h{x ~ Xi)Yi^ then 

mi(a:;) c:^ '^"^ ^"=1 -^^(^ ~ Xi)Yi, where ~ is due to the integral approximation error 
TJj=i Kh{x - Xj)Kh{xj - Xi) ^ J Kh{x - z)Kh{z - Xi) dz = (AT * K)h{x - Xi). 

In this paper, we consider random covariates Xi. We derive the theoretical properties 
of L2 boosting when the learner is the Nadaraya- Watson kernel smoother, that is. 
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The Nadaraya- Watson smoothing is the simplest and numerically most stable technique 
of local kernel regression. We note that statistical properties of L2 boosting for r > 1 
with Nadaraya-Watson smoothing have not been investigated before. 

Throughout the paper, we assume is a symmetric probability density function which 
is Lipschitz continuous and is supported on [—1,1]. The bounded support condition for 
K can be relaxed to include kernels, such as Gaussian, that decrease to zero at tails with 
an exponential rate. Below, we discuss the asymptotic properties of rhr for r > 1. For 
this, we assume that ft, — > and nh/logn 00 as n —>■ 00. 

We denote a pre-estimate of m by m. Thus, at the rth iteration fh = rtir-i- Let m be 
its update defined by 

n 

m{x) =m{x) +^^Wi{x)[Yt - fh{X^)], (1) 
1=1 

where Wi{x) = Ej=i Kh{Xj — x)]~^ Kh{Xi —x). At the rth iteration m = rhr. Note that, 
for the initial estimate 'rhQ{x) = X]j=i get 

n n 

Tho{x) — m{x) — Wj {x)ej + Wj{x)[m{Xj) — m{x)], 

where ej ^Yj — m,{Xj). 

Let Wj be the weight functions for m that depend solely on Xi , . . . , Xn and satisfy 

n n 

Wj (x) = 1 for all X, fh{x) = wj {x)Yj . 
Define the updated weight functions by 

n 

Wj{x) =Wj{x) +Wj{x) — Wi{x)Wj{Xi). 

i=l 

Note that wj also depends solely on Xi, . . . ,X„. One can verify 

n n 

Wj (x) = 1 for all X, m,{x) = Wj{x)Yj, 
i=i i=i 

so that 

n n 

m{x) - m{x) = ^Wj{x)ej - M^)]- (2) 

From (2), we note that Ya,r{m{x)\Xi, . . . ,X„) = ^^^iWj{xya^{Xj), where cr^{x) = 
Var(y |X = x). The following theorem provides the magnitude of the conditional variance. 
Let / denote the marginal density / of the covariatc Xi . We assume that / is supported 
onZ= [0,1]. 
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Theorem 1. Assume that f is continuous onl and infx^i f(x) > 0. -(/"sup^gj^J^-^ Wj{x)'' 
Op{n^^h^^) uniformly for X € 2 , then sup^^j ^"^j^ (x)^ = Op(n~^/i~"'^) uniformly for 
x€l. 

In the proof of Theorem 1 given below, we prove that J2^^iw'j{x) = Op{n~^h~^) 
uniformly for x GX. Thus, the weight functions Wj for the initial estimate mo satisfy the 
condition of Theorem 1. This means that L2 boosting does not deteriorate the order of 
the variance of the estimate as the iteration goes on. 

Proof of Theorem 1. It follows that 



J2 {x)<3Y, ^] (^) + 3 E ix) + 3Y, 



i=i 



y^^w.i{x)wj{Xi 



j=l j=l j=l i=l 



< 3^«;|(x) + 3 



sup 



i=i 



1 



3^«;|(a;) + 6 
i=i 



sup Wj (a 



i=i 



To complete the proof, it remains to show that X]j=i "^ji^) — OpC*^ " ) uniformly for 
a; £ X. Let ^ [h,l — h]. Then, 

_i -A _ J /(a-) + Op(l), uniformly for x G Ih, 

n 2^J^h[^t ~ \ f(^x)Ci(^x)+Opil), uniformly for a; G 

ri-i/iV^fA- ('y _^^l2 _ //(a;)C2+Op(l), uniformly for a; eXh, 

^2-.^^'^^^' a^jJ ~ \/(a;)C73(a;)+Op(l), uniformly for a; £ 



where 1/2 < Ci(x-) < 1, C2 = J\k^ and A'^ < C3(a;) < J\k^. From this, we con- 
clude 



uniformly for a; e X. 



□ 



Next, we discuss the conditional bias of the update m. The conditional biases of m 
and rh equal ~ ^.(x)] and X]J=i ~ '>t^{x)], respectively. 
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In the case where the pre-estimate m is the initial estimate mo , we have Wj = Wj and 



E, ^ / \i EKhiXi — x)\m(Xi) - m(x)] ^ 
Wj{x)[m{Xj) - m{x)] = ^ t,,, !L . — + 0„ 



EK,,{X^-x) 



h log n 



uniformly for a; G le for arbitrarily small £ > 0, sufficient smoothness of m and / permit- 
ting. The Op(-\/ n^^hlogn) in the above expansion comes from the mean zero stochastic 
terms in the numerator and denominator of the left-hand side. 

Theorem 2. Assume that f is continuously differentiable on X_ = (0, 1) and mixi^i fix) > 
0. Let r >1 be an integer. Suppose that 



'^^Wj{x)[m{Xj) — m{x)] = h^^ an{x) + Of 



h\ogn 



(3) 



uniformly for x G le for arbitrarily small e > 0, where q;„ is a sequence of functions that 
are twice differentiable on Z_ and satisfies 



lim hm sup sup \al^^ (u) - a^f ^ (w) | = 



(5^0 



»oo |ti-i>|<i5 



(4) 



fork = 0,1, 2. Then, 



Y,Wj{x)[m{X,)-m{x)]=h^'-''+'^l3n{x)+0f 



h log n 



uniformly for a; G for arbitrarily small e > 0, where f3n{x) is a deterministic sequence 
such that 



a';,{x)f{x)+2a',,{x)f'{x) 



u'^K{u)du + o{l). 



Theorem 2 tells that each step of L2 boosting improves the asymptotic bias of the 
estimate by two orders of magnitude if m and / are sufficiently smooth. When m = ttiq. 



anix) = h 



^^EKhjXi - x)[m{Xi) ^ m{x)] 
EKh{Xi-x) 

m"{x)f{x)+2m'{x)f'{x) 



(5) 



u'^K{u) du + o(l), 



which can be shown to satisfy (4), sufficient smoothness of m and / permitting. In 
general, if m and / are sufficiently smooth, the corresponding sequence of the functions 
an at each step of the iteration satisfies (4). 



L2 boosting in kernel regression 



605 



For the functions class 

T{v,C) = {m: |to(LH)(2.) _ m^V^l)^^')] < C\x - x'l"-^"^ for all x,x' £l}, 

where [i^J is the largest integer that is less than i^, it is known that the minimax op- 
timal rate of convergence for estimating m equals . Let denote the esti- 
mate updated at the rth iteration. The following theorem implies that the L2 boosted 
Nadaraya- Watson estimate is minimax optimal if the iteration number r is high enough 
and the bandwidth is chosen appropriately. 



Theorem 3. Assume that m G Ci), f £ J-if — 1, C2) for v>2 and mix^i f{x) > 0. 
Let r>\v/2\ he an integer. Then, 



E[mr{x)\Xu...,X^]~m{x)=OA + . r^}^] (6) 



uniformly for x for arbitrarily small £ > 0. 



Theorems 1 and 3 imply that 

Elirhrix) - m{x)f\Xi, . . . ,X„] = Opin^^h'^ + h'^'') 

for r > [i^/2j . Thus, if one takes h ~ 0(n^^/(^'^+^') and r > lv/2\ , then riir achieves the 
minimax optimal rate of convergence. We note that Biihlmann and Yu (2003) obtained 
similar results for smoothing spline learners. They took the iteration number as the 
regularization parameter and held the penalty constant fixed. In the case of the cubic 
smoothing spline learner, for example, they showed that if r = 0{n^^^^'^'^^^), then the rth 
updated estimate achieves the optimal rate; see their Theorem 3. 



Proof of Theorem 2. Fix e > 0. Then, for sufficiently large n, all Xi with sup^gx^ ''^ii^) > 
lie in 1^/2- Thus, the expansion (3) holds if we replace a; by a random Xi with 
sup^gj^ Wi(x) > 0. This implies that, uniformly for x S X^, 



^w,{x) 



i=l 



Y,w,{X,){m{X^)-m{X{)) 



= /i2'-^u;,(x)a„(X0-f Op(p„), (7) 



i=l 



where pn = yn ^/ilogn. From (3) and (7), we have 

n n 

^Wj{x)[m.{Xj) - m{x)] ^^Wj{x){m{Xj) - m{x)) 



i=i 



Y,^,{X,){m{X,)-m{X,)) 



(8) 
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i=l 

uniformly for x gX^. Define 7„(u,x) = [a„(x) — a„(u)]/(u). Then, 
1 " f 

-^Kh{Xi ~ x)[anix) ~ aniX^)] = / Kh{u - x)jn{u,x) du + Op{pn) 



2—1 



uniformly for x gT^, where 

»1 rl 



-h'-i'^{x,x) / u^K{u)du + rn{x) + Op{pr,) 



r„(a;) = /i / / K {u)['~i'^{x ~ huv , x) - ^'^{x , x)]{l - v) dv du 
J -I Jo 

and 7"(u,a;) = 9^7„(u,x)/9u^. Note that 7"(x,x) = — [a"(x)/(a;) + 2aj^{x)f'{x)] and 
that, for any (5 > 0, 

limsup sup /i~^|r„(x)| < limsup sup sup |7"(a; — M,a;) — 7^'(a;,a;)| / u^K{u)du. 

Thus, from (4) we obtain r„(a;) = o{h?) uniformly for x €1^- Since ^"=1 ^h{Xi — 
a;) = f{x) + Op(l) uniformly for x £ X^, we complete the proof of Theorem 2. □ 

Proof of Theorem 3. Let p~ \ v When p = (i/ = 2), we know 

E{rhQ[x)\Xx, ■ ■ . - m(x) = h^an(x) + Op(p„), 

where is given at (5). When p > 1 (i^ > 2), one can verify by repeated applications of 
Theorem 2 that 

£'[TOp_i(x)|Xi, . . . ,X„] - m{x) = ]-?^an{x) + Op(p„) 

uniformly for x (zT^ for arbitrarily small e > 0, where is a sequence of functions. If 
1/ = 2p + ^ for some integer p>l and < ^ < 1 , then satisfies 

limsup sup \an{u) — an{v)\ <Ci5^ 

n^oo |m— t)|<i5 

for some Ci > 0. Since 

n 

E[rhp{x)\Xi, . . . ,X„] — m{x) = h'^P^w,{x)[a 

n(*^) 

[Xi)] + Op(p„) (9) 

i=l 

uniformly for a; G as in (8), we obtain (6). 
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Next, if = 2p + 1 + ^ for some integer p > 1 and < ^ < 1, then 



limsup sup |a^(u) — a^(ti)| < C2^^ 



n^ca \u-v\<S 



for some C2 > 0. Note that 



n 



n 



1 Kh {X, - x) [an {x) + (x) (Xi ~x)-an {Xi)] 



n 



< C2h^n-' - ^l^hiX, - x) 



(10) 



i=l 



= Op(/ii+«) 



uniformly for x G Xe- From (9) and (10), we obtain (6) in the case i' = 2p + 1 + ^, too. 



Two important issues that need particular attention are the choice of the bandwidth 
h and that of the iteration number which may have substantial influence on the per- 
formance of the estimator for a finite sample size. These are related to each other in 
the sense that both h and r are regularization parameters and interplay each other. An 
optimal choice for one of them depends on the choice of the other. In their smooth- 
ing spline approach, Biihlmann and Yu (2003) fixed the penalty constant, whose role 
is the same as that of the bandwidth h in our setting, and find the optimal rate of in- 
crease for r (as the sample size grows), as given in the above paragraph. Our theory is 
for the other way around. It suggests that taking sufficiently large r so that r > [z^/2j, 
but fixed without tending to infinity as the sample size grows, gives an optimal per- 
formance in terms of rate of convergence if the bandwidth h is chosen in an optimal 
way. 

In practical implementation of the boosting algorithm where the sample size is fixed, 
letting r ^ 00 alone leads to overfitting and thus jeopardizes the boosting method. 
One may think it is possible to avoid overfitting by increasing the bandwidth. How- 
ever, increasing the bandwidth to reduce the variance of the estimator would also in- 
crease the bias, which may result in an increase of the mean squared error if r is 
too high. Thus, one should use a data-dependent stopping rule for the iteration, as 
well as a data-driven bandwidth selector. For this one may employ a cross-validatory 
criterion, or the test bed method, as discussed in Gyorfi, Kohler, Krzyzak and Walk 
(2002) and Bickel, Ritov and Zakai (2006). To describe the latter method for selection 
of both h and r, write fa^{-]h) rather than rhj. to stress its dependence on h, and let 
{(X„_|_i, l^_(_i), . . . , {XnJ^B, Yyi+s)} be a test bed sample that is independent of the train- 
ing sample {{Xi.Yi), . . . , {Xn, Yn)}. Define, for each r > 1, 



This completes the proof of Theorem 3. 



□ 
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Then one can take f for a stopping rule defined by 

and the data-driven bandwidth hf . It would be of interest to see whether the regression 
estimator with these data-driven choices f and hf achieves the minimax optimal rate 
without ly, the smoothness of the underlying function, being specified. We leave this as 
an open problem. 

A method based on a cross-validatory criterion can be described similarly. As 
an alternative to these methods that are based on estimation of the prediction 
error, one may estimate the mean squared error of the estimator rhr{-]h) and 
then choose h and r that minimize the estimated mean squared error. There have 
been many proposals for estimating the mean squared errors of kernel-based esti- 
mators of the regression function in connection with bandwidth selection; see, for 
example, Ruppert, Sheather and Wand (1995) and Section 4.3 of Fan and Gijbels 
(1996). 



3. Numerical properties 

In this section, we present the finite sample properties of the I/2 boosting estimates. To 
see how L2 boosting compares favorably to the use of higher-order kernels as a method 
of bias reduction, we consider 

^ ^" E:uAt'(^.-x) ' 

where i^T^ is a 2(r + l)th-order kernel defined by, with _ftr["l = 

i^M [x) = 2ii:[''-il (x) - ['■-11 * A' ['■-11 . 

Sufficient smoothness of m and / permitting, to^ is known to have a bias of order h?^^'^^^ , 
which is of the same magnitude as the bias of the r-stcp boosted estimate m^. 
The simulation was done under the following two models: 

(1) m(a;) = sin(27Ta;), < a; < 1; 

(2) m(x) = |{3sin(47Ta;) + 2sin(37Tx)}, 0<.t<1. 

We took ?7(G, 1) for the distribution of A.;, and A(0,G.5^) for the errors. For each 
model, two hundred pseudo-samples of size n = 100 and 400 were generated. We used 
the Gaussian kernel K . We evaluated the mean integrated squared errors (MISE) of the 
estimates based on these samples. For this, we took 101 equally spaced grid points on 
[0,1] and used the trapezoidal rule for the numerical integration. 
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Figure 1. (a) Integrated squared bias for r = 0, 1, ... ,6 based on 200 pseudo-samples of size 
n — 400 from the model (1). (b) Integrated squared bias for r = 0, 1, ... ,6 based on 200 pseu- 
do-samples of size n = 400 from the model (2). The left panel is for the L2 boosting estimate 
and the right panel is for the higher-order kernel estimate. 



Figures 1-3 show how the bias, variance and MISE of the estimates change as the 
boosting iteration number or the order of the kernel increases when n = 400. The result 
for r = corresponds to the Nadaraya- Watson estimate. The curves in Figures 1 and 2 
depict the integrated squared biases (ISB) and the integrated variance (IV), respectively, 
as functions of the bandwidth, and those in Figure 3 represent MISE. Table 1 gives the 
minimal MISE along with the optimal bandwidths that attain the minimal values for 
both sample sizes n = 100 and 400. 
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-5 -4 -3 -2 -S -4 -3 

log(h) 105(h) 



(b) 

Figure 2. (a) Integrated variance for r = 0, 1, . . . , 6 based on 200 pseudo-samples of size n = 400 
from the model (1). (b) Integrated variance for r = 0, 1, ... ,6 based on 200 pseudo-samples of 
size n = 400 from the model (2). The left panel is for the L2 boosting estimate and the right 
panel is for the higher-order kernel estimate. 

For the L2 boosted estimates, we see from the figures that the ISB reduces as the 
boosting iteration number r increases in the whole range of the bandwidth. In particular, 
it decreases rapidly at the beginning of the boosting iteration and the degree of reduction 
decreases as r increases. On the other hand, the IV increases at a relatively slower 
rate as r increases. Since the decrement of the ISB (as r increases) is greater than the 
increment of the IV for moderate-to-large bandwidths {h > e^^ '^ « 0.05 for model (1) 
and h > e~^-^ w 0.025 for model (2)), and the former is smaller than the latter for small 
bandwidths, the value of MISE gets smaller as r increases in the range of moderate-to- 
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Figure 3. (a) Mean integrated squared error for r = 0, 1, . . . , 6 based on 200 pseudo-samples of 
size n = 400 from tlie model (1). (b) Mean integrated squared error for r = 0, 1, ... ,6 based on 
200 pseudo-samples of size n — 400 from the model (2). The left panel is for the L2 boosting 
estimate and the right panel is for the higher-order kernel estimate. 



large bandwidths, while it becomes larger in the range of small bandwidths. The results 
in Table 1 show that the minimal value of MISE always decreases and the optimal 
bandwidth gets larger as r increases. These results confirm our theoretical findings that 
L2 boosting improves the order of the bias while not deteriorating the order of the 
variance. 

For the higher-order kernel estimates, the behavior of the ISB and the IV as the 
order of kernel r changes is similar to that of L2 boosting except for small band- 
widths. For small bandwidths, not only the IV but also the ISB increases as r in- 
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Table 1. Minimal MISE and the corresponding optimal bandwidth h 



Model (1) Model (2) 



Z/2-boosting Higher-order L2-boosting Higher-order 

kernel kernel 



MISE h MISE r h MISE h MISE 



n= 100 






0.050 


0.0215 


0.050 


0.0215 


1 


0.080 


0.0188 


0.080 


0.0208 


2 


0.100 


0.0176 


0.100 


0.0213 


3 


0.120 


0.0168 


0.120 


0.0223 


4 


0.130 


0.0162 


0.140 


0.0231 


5 


0.140 


0.0157 


0.160 


0.0238 


6 


0.150 


0.0153 


0.180 


0.0242 


n = 


400 











0.040 


0.0070 


0.040 


0.0070 


1 


0.060 


0.0059 


0.060 


0.0066 


2 


0.080 


0.0054 


0.070 


0.0068 


3 


0.090 


0.0051 


0.090 


0.0072 


4 


0.100 


0.0049 


0.100 


0.0075 


5 


0.110 


0.0047 


0.110 


0.0077 


6 


0.120 


0.0046 


0.120 


0.0079 






0.030 


0.0431 


0.030 


0.0431 


1 


0.045 


0.0355 


0.045 


0.0436 


2 


0.060 


0.0324 


0.070 


0.0493 


3 


0.065 


0.0305 


0.085 


0.0544 


4 


0.075 


0.0293 


0.100 


0.0588 


5 


0.080 


0.0284 


0.110 


0.0624 


6 


0.085 


0.0277 


0.125 


0.0649 





0.020 


0.0124 


0.020 


0.0124 


1 


0.035 


0.0099 


0.035 


0.0118 


2 


0.045 


0.0091 


0.045 


0.0125 


3 


0.055 


0.0086 


0.050 


0.0134 


4 


0.060 


0.0082 


0.055 


0.0143 


5 


0.065 


0.0080 


0.060 


0.0150 


6 


0.070 


0.0077 


0.065 


0.0157 



creases. This is contrary to the theory. In particular, the values of the ISB and IV 
explode when r is large. Although not presented in this paper, we observed that the 
bad behavior is more severe when n = 100 and it starts at a relatively larger band- 
width than in the case of n = 400. Furthermore, Table 1 reveals that the minimal 
value of MISE starts to increase at some point as the order of the kernel r increases. 
This erratic behavior of the higher-order kernel estimate is due to the fact that its de- 
nominator often takes near-zero or even negative values, which occurs more often for 
larger r, and it makes the estimate very unstable. This suggests that, contrary to L2 
boosting, the theoretical advantages of higher-order kernels do not take effect in prac- 
tice. 
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